[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2 by lishunyang12 · Pull Request #2924 · vllm-project/vllm-omni

lishunyang12 · 2026-04-19T19:08:10Z

Purpose

Phase 1 of #2709 — extends ModelOpt FP8 support to video-gen models. #2913 covers Phase 1 for image-gen (Flux, Flux2-Klein, Qwen-Image, HunyuanImage-3); this PR adds the video-gen counterpart for both HunyuanVideo-1.5 and Wan2.2 TI2V-5B using the same loader infrastructure.

Builds on:

（Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 #2913 (Phase 1, image-gen) — ModelOpt FP8 checkpoint auto-detect + diffusers loader adapter (cherry-picked into this branch; rebases away when （Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 #2913 merges)
[Quant] Wire quant_config through HunyuanVideo-1.5 and Wan2.2 DiT for online FP8 #2920 (referenced, will not merge) — DiT quant_config wiring for HV-1.5 + Wan2.2 (extracted into this PR; [Quant] Wire quant_config through HunyuanVideo-1.5 and Wan2.2 DiT for online FP8 #2920 stays as online-FP8 ablation reference)

Changes

DiT wiring (extracted from #2920)

hunyuan_video_15_transformer.py + pipelines — HunyuanVideo15Attention, HunyuanVideo15TransformerBlock, HunyuanVideo15Transformer3DModel accept quant_config / prefix; threaded to to_qkv, to_out[0], add_kv_proj, to_add_out, ff, ff_context.
wan2_2_transformer.py + wan2_2_vace_transformer.py + 4 pipelines — WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, WanTransformer3DModel, VACE variant. Factories (create_transformer_from_config, create_vace_transformer_from_config) accept optional quant_config.
Modulation (raw nn.Linear / scale_shift_table), patch embedders (Conv3d), time/text/image embedders, proj_out, and the HV-1.5 token refiner stay full precision.
The aggressive skip patterns from [Quant] Wire quant_config through HunyuanVideo-1.5 and Wan2.2 DiT for online FP8 #2920 (attn1/attn2 quant_config=None on Wan2.2) are not applied here — that was an online-FP8 workaround; static calibration handles it.

ModelOpt FP8 helpers

examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py — HV-1.5 calibrator. Force-exports FP8 weights, patches quant_algo: FP8, hides quantizers during save. MHA quantizers off by default.
examples/quantization/quantize_wan2_2_modelopt_fp8.py — Wan2.2 TI2V-5B calibrator. Same design.
examples/quantization/check_modelopt_fp8_export.py — verifier. Reads safetensors header dtypes, checks quant_algo: FP8, classifies scale granularity (per-tensor / per-channel / per-block).
vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml + wan2_2_ti2v_dit_fp8.yaml — serving stage configs with auto-detect.

Adapter (this PR also fixes a general-purpose bug in #2913's adapter):

modelopt_fp8.py:_get_weights_mapper now walks submodules to aggregate hf_to_vllm_mapper from whichever sub-module defines it. The adapter is instantiated with the whole Pipeline, so model-specific remaps (like Wan2.2's ffn.net.0. → ffn.net_0.) must be discovered on the transformer submodule, not the top-level Pipeline. Fixes silent-noise output that occurred on Wan2.2 ModelOpt FP8 before this change.
WanTransformer3DModel.hf_to_vllm_mapper added with that remap.

Both calibrators share --weight-block-size 'M,N' for block-wise FP8, and the same fallback pattern: _force_export_quantized_weights + _patch_quant_config + hide_quantizers_from_state_dict — because ModelOpt's export_hf_checkpoint doesn't handle diffusers-video checkpoints natively.

Validation — HunyuanVideo-1.5 (1×H100 80GB, T2V 480×832, 33 frames, 30 steps, seed=42)

torch.compile enabled (default).

Metric	BF16 baseline	ModelOpt FP8 (this PR)	Delta
Model load	33.81 GiB	28.74 GiB	−15%
Peak GPU memory (allocated)	72.42 GiB	67.36 GiB	−7%
Total wall time	24.05 s	20.79 s	−14%
Throughput	1.44 it/s	1.67 it/s	+16%
On-disk transformer weights	31.02 GiB	10.45 GiB	−66%

Engine signals confirming the path is wired correctly:

factory.py: Building quantization config: fp8 → Building quantization config: modelopt — auto-detect upgraded the user's --quantization fp8 flag to ModelOpt based on quant_algo: FP8 in transformer/config.json
data.py: Auto-detected quantization 'modelopt' from model config
__init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod — the ModelOpt FP8 kernel selected

Visual comparison — HunyuanVideo-1.5

BF16 baseline:

hv15_bf16_compiled.mp4

ModelOpt FP8 (this PR):

hv15_modelopt_fp8_compiled.mp4

Same prompt ("A dog running across a field of golden wheat."), same seed, same sampling params. Output is BF16-equivalent — no detail collapse or composition drift like the online FP8 path showed in #2920.

Validation — Wan2.2 TI2V-5B (1×H100 80GB, T2V 704×1280, 49 frames, 30 steps, seed=42)

torch.compile enabled (default).

Metric	BF16 baseline	ModelOpt FP8 (this PR)	Delta
Model load	21.22 GiB	16.68 GiB	−21%
Peak GPU memory (allocated)	38.28 GiB	33.74 GiB	−12%
Total wall time	19.75 s	16.63 s	−16%
Throughput	1.96 it/s	2.45 it/s	+25%
On-disk transformer weights	21.22 GiB	4.76 GiB	−77%

Engine signals:

factory.py: Building quantization config: fp8 → modelopt (auto-detect fired)
data.py: Auto-detected quantization 'modelopt' from model config
__init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod
Zero unloaded weight_scale warnings after the hf_to_vllm_mapper fix for Wan2.2's ffn.net.0. → ffn.net_0. diffusers↔vllm-omni name remap.

Visual comparison — Wan2.2 TI2V-5B

BF16 baseline:

wan22_bf16_v4.mp4

ModelOpt FP8 (this PR):

wan22_modelopt_fp8_v4.mp4

Same prompt ("A dog running across a field of golden wheat."), same seed, same sampling params. Output is BF16-equivalent.

How to use

Pre-calibrated checkpoints are published on HF Hub so reviewers can test without recalibrating:

Option A: use the published checkpoints (no calibration needed)

# HunyuanVideo-1.5
python examples/offline_inference/text_to_video/text_to_video.py \
    --model shunyang90/HunyuanVideo-1.5-480p-ModelOpt-FP8 \
    --quantization fp8 \
    --prompt "A dog running across a field of golden wheat." \
    --height 480 --width 832 --num-frames 33 \
    --num-inference-steps 30 --seed 42 --guidance-scale 6.0 \
    --output outputs/hv15_modelopt_fp8.mp4

# Wan2.2 TI2V-5B
python examples/offline_inference/text_to_video/text_to_video.py \
    --model shunyang90/Wan2.2-TI2V-5B-ModelOpt-FP8 \
    --quantization fp8 \
    --prompt "A dog running across a field of golden wheat." \
    --height 704 --width 1280 --num-frames 49 \
    --num-inference-steps 30 --seed 42 --guidance-scale 5.0 \
    --output outputs/wan22_modelopt_fp8.mp4

Option B: calibrate from BF16 yourself (reproducibility / custom prompts)

# 1. Install
pip install 'nvidia-modelopt[all]'

# 2a. Calibrate HV-1.5 (~10–15 min on 1×H100)
python examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py \
    --model hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v \
    --output ./hv15-480p-modelopt-fp8 --overwrite

# 2b. Calibrate Wan2.2 TI2V-5B (~10 min on 1×H100)
python examples/quantization/quantize_wan2_2_modelopt_fp8.py \
    --model Wan-AI/Wan2.2-TI2V-5B-Diffusers \
    --output ./wan22-ti2v-modelopt-fp8 --overwrite

# 3. (optional) Verify
python examples/quantization/check_modelopt_fp8_export.py --output ./hv15-480p-modelopt-fp8
python examples/quantization/check_modelopt_fp8_export.py --output ./wan22-ti2v-modelopt-fp8

# 4. Serve — auto-detect upgrades --quantization fp8 to ModelOpt FP8
# (same invocation as Option A, just pass the local output path as --model)

Test Plan

HunyuanVideo-1.5

Calibration script completes on 1×H100 — 648 weights converted to FP8
Checker reports quant_algo: FP8, 648 F8_E4M3 tensors, per-tensor scale granularity
On-disk transformer 10.45 GiB (−66% vs 31.02 GiB BF16)
Loads via （Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 #2913's adapter (Auto-detected quantization 'modelopt')
End-to-end inference produces valid video; visual parity with BF16
Memory −15%, wall-clock −14% vs BF16

Wan2.2 TI2V-5B

Calibration script completes on 1×H100 — 300 weights converted to FP8
Checker reports quant_algo: FP8, 300 F8_E4M3 tensors, per-tensor scale granularity
Loads cleanly via （Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 #2913's adapter (after the hf_to_vllm_mapper fix — see adapter change below)
End-to-end inference produces valid video; visual parity with BF16
Memory −21%, wall-clock −16% vs BF16
On-disk transformer 4.76 GiB (−77% vs 21.22 GiB BF16)

Both

Pre-commit (ruff, format, typos) — passing
torch.compile enabled (default) on both BF16 and FP8 for fair comparison
HV-1.5 I2V variant + Wan2.2 I2V / T2V-A14B / VACE — wiring threaded, calibration untested

Known limitations

Per-block static FP8 is calibrated-correct but not serveable yet. Upstream vLLM's ModelOptFp8Config / ModelOptFp8LinearMethod only dispatches per-tensor scales — a block-wise checkpoint crashes at load with a shape-mismatch assertion in parameter.py:_assert_and_load. Per-tensor serving is the shippable path; --weight-block-size is kept in the calibrator for when upstream gains block-wise dispatch.
HV-1.5 and Wan2.2 aren't in ModelOpt's recognized-model registry — QKV fusion is skipped and we hand-roll the weight-export path. Works, but means fewer of ModelOpt's standard diffusion optimizations.
MHA quantizers (K/V/softmax) off by default — attention numerics on long video sequences were sensitive even with static scales (empirically in [Quant] Wire quant_config through HunyuanVideo-1.5 and Wan2.2 DiT for online FP8 #2920 ablation).

Follow-ups (still Phase 1, other video/variant coverage)

Wan2.2 T2V-A14B / I2V-A14B MoE variants (need 2×H100)
Wan2.2 VACE variant (wiring threaded; calibration helper needs VACE-specific prompts)
HunyuanVideo-1.5 720p + I2V variants
Block-wise static FP8 serving once upstream vLLM dispatches on strategy: block
Publish calibrated checkpoints to HF Hub under vllm-project-org/

Depends on #2913. References #2920 (online-FP8 ablation reference, will not merge).

cc @baonudesifeizhai @hsliuustc0106 @ArtificialRay

Signed-off-by: roG0d <rodgarcas98@gmail.com>

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

…vllm-project#2920) Threads quant_config / prefix through HunyuanVideo15Attention, HunyuanVideo15TransformerBlock, and HunyuanVideo15Transformer3DModel so the modelopt FP8 adapter from vllm-project#2913 has somewhere to bind per-layer scales. Modulation, embeddings, proj_out stay raw nn.Linear (full precision). Signed-off-by: lishunyang <lishunyang12@163.com>

…eo-1.5 examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps, skips precision-sensitive layers (modulation, embeddings, output proj, token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920). vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter). Signed-off-by: lishunyang <lishunyang12@163.com>

…ce_scale kwarg unsupported) HV-1.5's diffusers pipeline uses the new Guider abstraction (guider_config.json in the checkpoint) rather than a guidance_scale kwarg. Try setting it on the guider object once up front; in the per-prompt call, try with guidance_scale first and fall back without it on TypeError. Calibration only needs amax stats, so the exact CFG value isn't critical. Signed-off-by: lishunyang <lishunyang12@163.com>

Three checks: (A) transformer/config.json has sane quantization_config, (B) safetensors contain FP8 tensors, (C) optional disk-size delta vs BF16. Run after the quantize_*_modelopt_fp8.py scripts to spot issues before attempting to serve. Signed-off-by: lishunyang <lishunyang12@163.com>

…or view) torch's get_tensor() returns FP8 storage as bf16 views on some safetensors versions, giving false negatives. Read the on-disk dtype from the header directly — that's what actually determines whether the checkpoint is FP8. Signed-off-by: lishunyang <lishunyang12@163.com>

The default export_hf_checkpoint() doesn't actually serialize weights as FP8 for unknown model types like HunyuanVideo15Transformer3DModel — it saves BF16 placeholders. The HunyuanImage-3 calibration helper hit the same bug. Three changes: - Manually call modelopt.torch.export.unified_export_hf._export_quantized_weight per-module to convert in-memory tensors to actual FP8. - Save the pipeline by hand (copy source minus transformer/, then save the quantized transformer with hide_quantizers_from_state_dict). - Patch transformer/config.json to inject quant_algo: FP8 + config_groups so vllm-omni's adapter (vllm-project#2913) auto-detects it. Signed-off-by: lishunyang <lishunyang12@163.com>

…, not pipeline Diffusers pipelines are ConfigMixin, not nn.Module — they don't have .named_modules(). Pass pipe.transformer directly. Signed-off-by: lishunyang <lishunyang12@163.com>

…ation fp8, not --stage-configs-path Signed-off-by: lishunyang <lishunyang12@163.com>

chatgpt-codex-connector · 2026-04-19T20:59:09Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

…block When --weight-block-size 'M,N' is given, override the weight quantizer with block_sizes={-1: N, -2: M} so each linear gets a (out//M, in//N) scale tensor instead of a scalar. Patched config_groups advertises strategy='block' + block_structure='MxN' so consumers know what to expect. Static FP8 is exempt from upstream vLLM's online block-wise gate, so this just works at serving time via vllm-project#2913's adapter. Default behavior unchanged (per-tensor) — pass --weight-block-size 128,128 to opt in. Signed-off-by: lishunyang <lishunyang12@163.com>

…s per-block) Reads shape info from safetensors header and classifies the checkpoint as per-tensor / per-channel / per-block based on whether weight_scale tensors are scalar, 1-D, or N-D. Helps verify --weight-block-size actually took effect (or if ModelOpt silently flattened to per-tensor). Signed-off-by: lishunyang <lishunyang12@163.com>

… granularity ModelOpt block-wise produces shapes like [16, 1, 16, 1] where size-1 dims are broadcasting axes. Classify by non-unity dim count: 0=per-tensor, 1=per-channel, 2+=per-block. Signed-off-by: lishunyang <lishunyang12@163.com>

…V-5B examples/quantization/quantize_wan2_2_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB BF16). Same design as the HunyuanVideo-1.5 calibrator (vllm-project#2924): force-export FP8 weights, patch quant_algo: FP8 into config.json, hide quantizers during save. Skips Wan2.2's precision-sensitive layers (condition_embedder, patch_embedding, proj_out, scale_shift_table, SP helpers). MHA quantizers off by default. vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Signed-off-by: lishunyang <lishunyang12@163.com>

…ject#2920) Threads quant_config / prefix through WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines (T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding (Conv3d), time/text/image embedders, and proj_out stay full precision. All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here — that was an online-FP8 quality workaround; static calibration handles it. Signed-off-by: lishunyang <lishunyang12@163.com>

…V-5B examples/quantization/quantize_wan2_2_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB BF16). Same design as the HunyuanVideo-1.5 calibrator (vllm-project#2924): force-export FP8 weights, patch quant_algo: FP8 into config.json, hide quantizers during save. Skips Wan2.2's precision-sensitive layers (condition_embedder, patch_embedding, proj_out, scale_shift_table, SP helpers). MHA quantizers off by default. vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Signed-off-by: lishunyang <lishunyang12@163.com>

…n.net_0) Wan2.2 ModelOpt FP8 checkpoint has diffusers-style dotted FFN names (ffn.net.0.proj, ffn.net.2) but vllm-omni's WanFeedForward uses underscored names (ffn.net_0.proj, ffn.net_2). The transformer's load_weights remaps these for .weight tensors, but the ModelOpt adapter resolves scale tensor names independently via WeightsMapper and was missing the remap — all 120 FFN scale tensors (30 blocks x 2 linears x 2 scales) silently fell through, leaving FP8 weights with no valid scales at serving time (visible as pure noise output). Fix: - Add hf_to_vllm_mapper class attribute on WanTransformer3DModel with the ffn remap. - Extend ModelOptFp8CheckpointAdapter._get_weights_mapper to merge a model's hf_to_vllm_mapper (if present) into the resolution map. Models can now register arbitrary substring remaps via this standard vLLM attribute. Signed-off-by: lishunyang <lishunyang12@163.com>

hsliuustc0106

This PR is substantial (>1000 LOC / >10 files). Could you please run the L3 tests locally and paste the results here?

Once L3 test results are available, I will proceed with a full review of the ModelOpt FP8 video-gen implementation.

Helps diagnose name-mismatch between checkpoint keys and model parameters (e.g. diffusers .ffn.net.0. vs vllm-omni .ffn.net_0.). Signed-off-by: lishunyang <lishunyang12@163.com>

…t FP8 adapter The adapter is instantiated with the whole Pipeline, not just the DiT. Only checking the top-level model means hf_to_vllm_mapper defined on a sub-module (e.g. WanTransformer3DModel inside Wan22TI2VPipeline) was invisible. Walk named_modules() and aggregate any mappers found. Signed-off-by: lishunyang <lishunyang12@163.com>

ArtificialRay · 2026-04-28T20:49:04Z

Hi, just want to double check. The throughput mentioned here is calculated directly by num_inference_step / wall_time ? Are these the throughput of DiT model only or includes all encoder/decoders ?

lishunyang12 · 2026-04-29T22:10:09Z

Hi, just want to double check. The throughput mentioned here is calculated directly by num_inference_step / wall_time ? Are these the throughput of DiT model only or includes all encoder/decoders ?

The average throughput can be captured by progress bar(tqdm) during denoising step, which does not consider encoder/decoders processing time. If you want to comfirm, just check the starting and ending point for tqdm and see if encoder and vae are in between or not.

…V-5B examples/quantization/quantize_wan2_2_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB BF16). Same design as the HunyuanVideo-1.5 calibrator (vllm-project#2924): force-export FP8 weights, patch quant_algo: FP8 into config.json, hide quantizers during save. Skips Wan2.2's precision-sensitive layers (condition_embedder, patch_embedding, proj_out, scale_shift_table, SP helpers). MHA quantizers off by default. vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Signed-off-by: lishunyang <lishunyang12@163.com>

* update quant wan2-2 modelopt to support A14B model * update wan2-2 modelopt quant script * update two gpu quantization quality script * update i2v modelopt quant script * update hunyuanvideo and wan2.2 vace modelopt script * add quantization config parsing image2video script * update vae-use-tiling for quantization quality script to avoid cuda oom for bf16 model * update vae-use-tiling for quantization quality script to avoid cuda oom for fp8 model(for vae) * update quantization quality script to support i2v videogen task * fix modelopt fp8 quantization script and quality script in T2V * update per-block quant * update vace videogen script * update quantization quality script to support model load and throughput calculation, and rewrite quant_quality script to automate model offline quant * fix quantization quality script in hunyuanvideo1.5 * update modelopt check script * update remote transmisson to bench_quant_videogen * update check_quant_videogen * update bench quant videogen script * update quality bench scripts to add negative prompt to wan2.2 I2V * update quality bench script for wan2.2 i2v * update quality bench script to add denoise throughput(s/it) * quant_quality script update for image gen model * del unrelative scripts * del unrelative scripts * update recommend test cmd after quantization for wan models Signed-off-by: ArtificialRay <shuaiweihuang@163.com> --------- Signed-off-by: ArtificialRay <shuaiweihuang@163.com>

…deo-1.5 video-gen Rebuilt on current main and consolidates the two video-gen ModelOpt PRs (vllm-project#2924 FP8 + vllm-project#3305 NVFP4) into one. The inference side — ModelOpt FP8/NVFP4/mixed checkpoint adapters, generic get_checkpoint_adapter loader wiring, and quant_config threading + PP through the DiTs — already landed on main (vllm-project#2913, vllm-project#3570 and follow-ups), so those commits are dropped as redundant. Net-new tooling that main does not have: - examples/quantization/: offline ModelOpt FP8 + NVFP4 calibration for Wan2.2 (TI2V-5B + VACE) and HunyuanVideo-1.5, plus export verifier, activation variance diagnostic, and NVFP4 quant_config patch helper. - stage_configs/{wan2_2_ti2v,hunyuan_video_15}_dit_fp8.yaml: DiT-only FP8 serve configs. Scripts are self-contained (diffusers + modelopt); produced checkpoints load via main's existing ModelOpt adapter. Signed-off-by: lishunyang12 <lishunyang12@163.com>

lishunyang12 · 2026-06-06T06:11:31Z

This will be covered by #3305

david6666666 · 2026-06-06T06:15:10Z

Superseded — consolidating the video-gen ModelOpt work into a single PR.

The inference side this PR carried (ModelOpt FP8 checkpoint adapter + diffusers loader wiring + DiT quant_config threading for HunyuanVideo-1.5 / Wan2.2) has since landed on main via #2913, #3570 and follow-ups, so it's now redundant.

The net-new pieces have moved:

Video-gen FP8 + NVFP4 calibration tooling (incl. Wan2.2 VACE) → [Quant] End-to-end ModelOpt FP8/NVFP4 for Wan2.2 & HunyuanVideo-1.5 video-gen #3305 (rebuilt on main; FP8 + NVFP4 in one PR).
Image-gen 2-GPU FP8 serve configs (Flux / Flux2-Klein / Qwen-Image / Z-Image) → [Quant] 2-GPU ModelOpt FP8 DiT serve configs for Flux / Flux2-Klein / Qwen-Image / Z-Image #4207.

Closing in favor of #3305 + #4207.

roG0d and others added 9 commits April 20, 2026 03:03

fix

d43aa6b

Signed-off-by: roG0d <rodgarcas98@gmail.com>

fix

9846e09

Signed-off-by: roG0d <rodgarcas98@gmail.com>

refactoring

12c8e5f

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

refactoring

f79a574

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

continue refacoring

e398ab3

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

fix huawei

8a3b83d

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

fix online server problem

b2b15f0

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

lishunyang12 changed the title ~~[Quant] ModelOpt FP8 for HunyuanVideo-1.5 (Phase 2 of #2709)~~ [Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 Apr 19, 2026

lishunyang12 added 6 commits April 20, 2026 03:50

[Quant] hide_quantizers_from_state_dict: pass transformer (nn.Module)…

d876c58

…, not pipeline Diffusers pipelines are ConfigMixin, not nn.Module — they don't have .named_modules(). Pass pipe.transformer directly. Signed-off-by: lishunyang <lishunyang12@163.com>

[Quant] Fix calibrator's 'next' hint: text_to_video.py uses --quantiz…

737db25

…ation fp8, not --stage-configs-path Signed-off-by: lishunyang <lishunyang12@163.com>

lishunyang12 marked this pull request as ready for review April 19, 2026 20:59

lishunyang12 requested a review from hsliuustc0106 as a code owner April 19, 2026 20:59

lishunyang12 changed the title ~~[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5~~ [Quant] Phase 1 (video-gen): ModelOpt FP8 Apr 19, 2026

lishunyang12 added 2 commits April 20, 2026 05:24

lishunyang12 mentioned this pull request Apr 19, 2026

[Quant] Phase 1 (video-gen): ModelOpt FP8 for Wan2.2 TI2V-5B #2927

Closed

8 tasks

lishunyang12 added 2 commits April 20, 2026 05:34

lishunyang12 changed the title ~~[Quant] Phase 1 (video-gen): ModelOpt FP8~~ [Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2 Apr 19, 2026

hsliuustc0106 reviewed Apr 19, 2026

View reviewed changes

lishunyang12 added 2 commits April 20, 2026 06:16

[Quant] ModelOpt FP8 adapter: log first 3 skipped scales for diagnostics

a5fb789

Helps diagnose name-mismatch between checkpoint keys and model parameters (e.g. diffusers .ffn.net.0. vs vllm-omni .ffn.net_0.). Signed-off-by: lishunyang <lishunyang12@163.com>

ArtificialRay mentioned this pull request Apr 20, 2026

[Doc]: Is HunyuanVideo-1.5 really support fp8 dynamic quantization #2912

Open

1 task

lishunyang12 mentioned this pull request Apr 24, 2026

[RFC]: Continuous Quantization Support #1854

Open

lishunyang12 marked this pull request as draft April 26, 2026 04:46

ArtificialRay mentioned this pull request May 7, 2026

Phase1 (video-gen) ModelOpt FP8 Follow-ups lishunyang12/vllm-omni#57

Merged

18 tasks

ArtificialRay and others added 2 commits May 18, 2026 01:21

Merge upstream/main into modelopt-fp8-hv15

4114b98

david6666666 mentioned this pull request Jun 6, 2026

[Quant] End-to-end ModelOpt FP8/NVFP4 for Wan2.2 & HunyuanVideo-1.5 video-gen #3305

Draft

lishunyang12 closed this Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2#2924

[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2#2924
lishunyang12 wants to merge 25 commits into
vllm-project:mainfrom
lishunyang12:modelopt-fp8-hv15

lishunyang12 commented Apr 19, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented Apr 19, 2026

Uh oh!

hsliuustc0106 left a comment

Uh oh!

ArtificialRay commented Apr 28, 2026

Uh oh!

lishunyang12 commented Apr 29, 2026

Uh oh!

lishunyang12 commented Jun 6, 2026

Uh oh!

david6666666 commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

lishunyang12 commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Validation — HunyuanVideo-1.5 (1×H100 80GB, T2V 480×832, 33 frames, 30 steps, seed=42)

Visual comparison — HunyuanVideo-1.5

Validation — Wan2.2 TI2V-5B (1×H100 80GB, T2V 704×1280, 49 frames, 30 steps, seed=42)

Visual comparison — Wan2.2 TI2V-5B

How to use

Option A: use the published checkpoints (no calibration needed)

Option B: calibrate from BF16 yourself (reproducibility / custom prompts)

Test Plan

Known limitations

Follow-ups (still Phase 1, other video/variant coverage)

Uh oh!

chatgpt-codex-connector Bot commented Apr 19, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

ArtificialRay commented Apr 28, 2026

Uh oh!

lishunyang12 commented Apr 29, 2026

Uh oh!

lishunyang12 commented Jun 6, 2026

Uh oh!

david6666666 commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lishunyang12 commented Apr 19, 2026 •

edited

Loading